183 research outputs found
Evaluation and selection of models for out-of-sample prediction when the sample size is small relative to the complexity of the data-generating process
In regression with random design, we study the problem of selecting a model
that performs well for out-of-sample prediction. We do not assume that any of
the candidate models under consideration are correct. Our analysis is based on
explicit finite-sample results. Our main findings differ from those of other
analyses that are based on traditional large-sample limit approximations
because we consider a situation where the sample size is small relative to the
complexity of the data-generating process, in the sense that the number of
parameters in a `good' model is of the same order as sample size. Also, we
allow for the case where the number of candidate models is (much) larger than
sample size.Comment: Published in at http://dx.doi.org/10.3150/08-BEJ127 the Bernoulli
(http://isi.cbs.nl/bernoulli/) by the International Statistical
Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm
Conditional predictive inference post model selection
We give a finite-sample analysis of predictive inference procedures after
model selection in regression with random design. The analysis is focused on a
statistically challenging scenario where the number of potentially important
explanatory variables can be infinite, where no regularity conditions are
imposed on unknown parameters, where the number of explanatory variables in a
"good" model can be of the same order as sample size and where the number of
candidate models can be of larger order than sample size. The performance of
inference procedures is evaluated conditional on the training sample. Under
weak conditions on only the number of candidate models and on their complexity,
and uniformly over all data-generating processes under consideration, we show
that a certain prediction interval is approximately valid and short with high
probability in finite samples, in the sense that its actual coverage
probability is close to the nominal one and in the sense that its length is
close to the length of an infeasible interval that is constructed by actually
knowing the "best" candidate model. Similar results are shown to hold for
predictive inference procedures other than prediction intervals like, for
example, tests of whether a future response will lie above or below a given
threshold.Comment: Published in at http://dx.doi.org/10.1214/08-AOS660 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
The distribution of a linear predictor after model selection: Unconditional finite-sample distributions and asymptotic approximations
We analyze the (unconditional) distribution of a linear predictor that is
constructed after a data-driven model selection step in a linear regression
model. First, we derive the exact finite-sample cumulative distribution
function (cdf) of the linear predictor, and a simple approximation to this
(complicated) cdf. We then analyze the large-sample limit behavior of these
cdfs, in the fixed-parameter case and under local alternatives.Comment: Published at http://dx.doi.org/10.1214/074921706000000518 in the IMS
Lecture Notes--Monograph Series
(http://www.imstat.org/publications/lecnotes.htm) by the Institute of
Mathematical Statistics (http://www.imstat.org
Can one estimate the conditional distribution of post-model-selection estimators?
We consider the problem of estimating the conditional distribution of a
post-model-selection estimator where the conditioning is on the selected model.
The notion of a post-model-selection estimator here refers to the combined
procedure resulting from first selecting a model (e.g., by a model selection
criterion such as AIC or by a hypothesis testing procedure) and then estimating
the parameters in the selected model (e.g., by least-squares or maximum
likelihood), all based on the same data set. We show that it is impossible to
estimate this distribution with reasonable accuracy even asymptotically. In
particular, we show that no estimator for this distribution can be uniformly
consistent (not even locally). This follows as a corollary to (local) minimax
lower bounds on the performance of estimators for this distribution. Similar
impossibility results are also obtained for the conditional distribution of
linear functions (e.g., predictors) of the post-model-selection estimator.Comment: Published at http://dx.doi.org/10.1214/009053606000000821 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …